Summary of dataset
Structure and attributes of the dataset was explored. The column heading for Column 1 was changed to Wine ID to reflect its contents.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## [1] "Wine ID" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Univariate Plots
In this section, each variable was plotted to see how white wines are distributed using histograms and boxplots. A summary variable was explored first. A histogram of how white wines were distributed in terms of the experts’ ratings, as summaried in the variable quality was plotted , following which, other variables were plotted. First, acidity attributes and pH of white wines were plotted. Then other chemical components were plotted. Before plotting, descriptive statistics for each variable was obtained to help choose bin widths and axis limits.
Histogram for visualising distribution of of white wines
by quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000

White wines of higher quality than 6 and then those under 5 were counted.
##
## FALSE TRUE
## 3838 1060
##
## FALSE TRUE
## 4715 183
From the above distribution, it is clear that a majority of the white wines fall in the medium quality range of 5-6 (3655). Lower quality wines were fewer (183) and so were higher quality wines (1060). The analysis in this project will focus on what variables contribute to this division and how they converge to create wines that can be categorised by quality.
Histograms for visualising distribution of acidity
aspects of white wines
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820

Acidity is an important aspect of white wines which contributes to its taste and finish. pH measures overall cidity and in white wines is between 3-3.3. Fixed acidity is a function of the grapes used for making the wine, while citric acid contributes to its crisp taste and finish. Volatile acidity is a contribution of acetic acid and generally, must be very low in wines chosen for consumption. From the above histograms, 3 variables that represent acidity of white wines seem to have long tailed distributions , except pH. Far - outliers were noticed for fixed acidity, volatile acidity and citric acid. For example, even though the mean citric acid content was only 0.3342, the maximum value in the dataset was 1.66, several fold higher than the mean! But the number of white wines showing these extreme variations from central tendency were low.
Histograms for visualising distribution of other
variables measuring chemical composition
Next, the other variables measuring the chemicalcomposition of white wines were plotted into histograms. As before, summaries of descriptive statistics were obtained to decide bin widths and axes.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800

From the above histograms, all 4 variables also seem to be long tailed distributions . Far - outliers were noticed for all variables, indicating significant variation in content. For example, the minimum and maximum values for free sulfur dioxide range from 2 to 289. But the closeness of the median and the mean indicate that the white wines showing extreme variations from central tendency were rare. On the whole salt content (chlorides) in the white wines tend to be low, while sulfur dioxides and dissoved sulphates, indicating presevatives to be around a standard contentration. The few outliers could have compromised taste.
Histograms for visualising distribution of interdependent
variables, alcohol, residual sugar and density of wines
using the function, get_histogram
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The summaries and plots indicate that most white wines have residual sugar levels less than 9.9 g/dm3, but many outliers are seen. The range is narrower for density, with alcohol being a spread out distribution. Residual sugar indicates the sugar left over after fermentation, while alcohol is made by fermenting the sugar. Both influence density. These will be useful variables to analyse in the bivariate plots section.
Ratings to bin quality of white wines
To bin white wines by quality, a rating system to group quality into 3 categories, Low (Quality = 3 or 4), Average (Quality = 5 or 6) and High (Quality = 7, 8 or 9) was introduced. A new variable, “Rating” was created in the database.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
## Low Average High
## 183 3655 1060

As expected for the distribution based on quality, most wines can be rated ‘Average’, with very few rated ‘low’. ‘High’ rated wines made less than a fourth of the white wines analysed in this dataset.
Univariate Analysis
Structure and attributes of dataset
My dataset is a data frame with 4898 obs. of 13 variables. The variables are Wine ID, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides free, sulfur dioxide, total sulfur dioxide, sulphates, alcohol,density and quality for 4898 white wines.
Main features of interest
To me, the main feature of interest in the data set is that a sensory measure (quality) for white wines is a conclusion of measurable proerties. How a complex interplay of physical and chemicalproperties affet the taster’s perception of quality, is the major point of interest.
Other features
The other 11 variables represent a combination of physical (density) and chemical (alcohol, sugars, pH) properties of the white wines. Some of the variables are known to to interdependent, such as fixed, acidity, citric acid and pH as well as density, residual sugars and alchol. These know associations make it easier to analyze the how main feature, quality of the white wines, is altered by physical and chemical properties.